Data Analytics and Visualisation

Slides

About Me

  • Vivek Katial (vivek@gooddatainstitute.com)
    • Co-founder and Executive Director @ Good Data Institute
    • Lead Data Scientist & Founding Team @ Multitudes
    • QC PhD / visiting PhD @ NASA Jet Propulsion Lab

Who is the Good Data Institute?

  • We are NFP that empowers other NFPs to build data capabilities.
  • Our volunteer community by does this by working on bespoke data projects for our NFP partners
  • We are a global community of 150+ data nerds

Who is the Good Data Institute?

Our Impact

  1. 75+ Data Projects
  2. 50+ Nonprofit Partners
  1. 150+ Data Nerds
  2. 10+ Countries
  3. 3000+ Volunteer Hours

Today’s Agenda

  1. Importance of Data Analytics and Visualisation
  2. Key Tools for Nonprofits
  3. Data Ethics and Algorithmic Bias
  4. Case studies
  5. Best Practices in Data Modeling and Visualisation
  6. Q&A

Importance of Data Analytics and Visualisation

Data Analytics

What is data analytics

  • Who has heard of ETL?

Extract
Transform
Load

What is data analytics (Extraction)?

  • ExtractionData Collection
  • Gathering data from multiple sources
    • Application / website data / APIs
    • Forms and surveys data (SurveyMonkey, Google Forms)
    • Live datafeeds (e.g. real time data from sensors)
    • Spreadsheet or CSV files
  • Needs to be robust, reliable and scalable

What is data analytics (Transform)?

  • TransformationData Cleaning + Enrichment and Aggregation
  • Cleaning – Removing missing values, duplicates, and outliers, checking for data consistency. Running unit tests!
  • Enrichment – Adding contextual information to the data; e.g. geocoding, demographic data, other datasets from ABS
  • Aggregation – Creating summaries of the data that can be used for analysis or shared with stakeholders (especially from an ethics perspective)

What is data analytics (Load)?

  • LoadingData Storage, Governance and Security
  • Storage – Storing data in a secure and accessible way
  • Governance – Ensuring data is used ethically and responsibly
  • Security – Protecting data from unauthorized access or misuse

Data Visualisation

Data Visualisation

  • Data Visualisation → Communicating insights truthfully and with beauty

  • Truthfully – Representing data accurately and without bias; avoiding misleading visualisations

  • Beauty – Making data engaging, emphasise key points, and tell a story. Provide context and make it easy to understand

Example - Climate & Conflict

Example - Climate & Conflict

Example - Climate & Conflict

Key Tools for Nonprofits

What tools do you use?

  • Poll: What tools do you use for data analytics and visualisation?
    • Microsoft Excel
    • Google Sheets
    • R and Python
    • Looker Studio
    • Tableau
    • Power BI
    • Custom Dashboards

Tools and Technologies

  • Basic Tools (Google Sheets, Microsoft Excel)
  • Looker Studio
  • Tableau Nonprofit Program
  • Microsoft Power BI
  • Free and Open Source Tools (R, Python libraries)
  • Custom Dashboards and Reports from Salesforce, etc.

How to Choose the Right Tool

  • Where is your data stored already?
  • What are your data visualisation needs?
  • What is your budget and technical capacity?

Pros and Cons of Different Tools

  • Microsoft Excel and Google Sheets: Easy to use and widely available, but limited features and not scalable or reliable/reproducible
  • R and Python: Highly customizable, but require coding skills and technical expertise
  • Looker Studio: Easy to use, but limited customization
  • Tableau and Power BI: Powerful features, but can be expensive
  • Custom Dashboards via ERPs: Tailored to your needs, but require development resources

Data Ethics and Algorithmic Bias

What is Data Ethics and Algorithmic Bias?

  • Data ethics refers to the moral and ethical implications of data collection, analysis, and use.
  • Algorithmic bias refers to the ability of algorithms to systematically and repeatedly produce outcomes that benefit one particular group over another
  • Already many examples in society where algorithms have harmed marginalised groups!

Trivial Example

  • Predictions on the image of the Western bride included labels such as “bride”, “wedding”, “ceremony”
  • For the woman wearing a traditional Indian wedding dress, the predicted labels were “costume”, “performing arts”, “event”

More Harmful Example

  • Evaluation of a model that uses facial recognition deployed by large technology companies

More Harmful Example

\[ P(\text{Dark}) \lt P(\text{Light}) \]

More Harmful Example

\[ \tiny P(\text{Dark} \cap \text{Male}) \lt P(\text{Dark} \cap \text{Female}) \lt P(\text{Light} \cap \text{Female}) \lt P(\text{Light} \cap \text{Male}) \]

More Harmful Example

  • What happens when you try and use a de-biasing parameter \(\alpha\) to reduce that bias.

1

How to get started on data ethics!

  • Create some data principles
  • This is a well-studied field. Get buy-in from leadership using existing research.
    • Timnit Gebru, CAIDE, AJL, etc.
  • Conduct a bad actor exercise to identify potential risks
  • Recognise that humans are the ones who create algorithms, so we also recognize the importance of the broader culture and environment we create and operate in.
  • Commit to learning more!
    • Weapons of Maths Destruction by Cathy O’Neil
    • Data Feminism by Catherine D’Ignazio and Lauren F. Klein

Example 1

Caring Kids Australia

Caring Kids Australia

  • Mission: To provide toy boxes to kids who support their family members facing chronic illnesses or disabilities
  • Data Challenge: They had a database of all the kids they’ve helped and they had addresses for all of them. They wanted to know where all the kids were located and how they could better serve them.

Caring Kids Australia

Caring Kids Australia

Example 2

Where are all GDI the projects located?

Where are all the projects located?

Where are all the projects located?

  • Write a basic SQL query to download all project data and write to csv
SELECT 
  project_name,
  charity_name,
  hq_country, 
  hq_city, 
  gdi_branch
  
FROM gdi_db.projects

Where are all the projects located?

Real Example (using R)

library(tidyverse)
d_projects <- read_csv("data/projects.csv")
  • Great! Now lets look at one row of the data

Real Example (using R)

d_projects %>% 
  slice(29) %>% 
  glimpse()
Rows: 1
Columns: 6
$ project_id   <int> 29
$ project_name <chr> "Biden-Harris Transition DEI"
$ charity_name <chr> "Inclusive America"
$ hq_country   <chr> "United States"
$ hq_city      <chr> "Washington DC"
$ gdi_branch   <chr> "Melbourne"

Real Example (using R)

d_projects %>% 
  count(hq_country) 
# A tibble: 11 × 2
   hq_country         n
   <chr>          <int>
 1 Australia         39
 2 Colombia           1
 3 India              1
 4 Indonesia          1
 5 New Zealand       19
 6 Spain              2
 7 Taiwan             1
 8 Thailand           1
 9 Uganda             2
10 United Kingdom     2
11 United States     11

Real Example (using R)

  • What do you think of this visualisation?

Real Example (using R)

Proper Data Visualisation

d_projects %>% 
  count(hq_country) %>% 
  ggplot(aes(x = reorder(hq_country,-n), y = n)) + 
  geom_bar(stat="identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    x = "Country",
    y = "Number of Projects",
  )

Proper Data Visualisation

d_projects %>% 
  count(hq_country) %>% 
  ggplot(aes(x = reorder(hq_country,-n), y = n)) + 
  geom_bar(stat="identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    x = "Country",
    y = "Number of Projects",
  )

Proper Data Visualisation

Comparison

  • Pie Chart 🥧
    • Hard to distinguish between the parts of a circle
    • So many colors, hard to process
    • Not the best choice for this data

  • Bar Chart 📊
    • Easier to read and understand
    • Reordered by number of projects
    • Clear labels

Question: Whats missing for both?

More advanced visualisation

What might this visualisation struggle to communicate?

Example 3

Pipeline Dreams with

Infectious Diseases are responsible for

6/10

top causes of death in the global south

Out of the drugs in development

10%

are targeting these disease areas

Low income countries produce less than

<5%

of the world’s scientific research

Scientific Method

Design

Test

Evaluate

AI can help and accelerate research

  • AI can help to accelerate research and development
  • Accelerate novel drug discovery in low-income countries by providing scientists in these areas with cutting-edge AI models
  • Ersilia’s model hub presently have 100+ models, each tailored for a unique aspect of drug discovery.

Problem

  • Limited compute capacity available to researchers in the Global South means using these models is slow
  • We want to build a database of pre-calculated ML predictions for commonly used molecules (reference library of 2M)
  • We should be able to return these predictions over the internet within seconds (not minutes)

Solution

Solution

Lessons Learned

  • Outcome: We can now return predictions in under 1 second
  • Technical domains take time to understand
  • Using an evolving architecture diagram can facilitate engineering work
  • infrastructure_as_code.(Terraform) == Good

Wrapping up

Best Practices in Data Analytics and Visualisation

  1. Start with a clear goal
  2. Understand your data
  3. Choose the right tool(s)
    • If you can write SQL or use Excel, you can write R or Python
    • Use scripts to automate repetitive tasks and invest in version control (e.g Github)
  4. Use the right visualisation for your data
    • Avoid pie charts
    • Use color and size to draw attention to key points
    • Make it easy to understand
    • Enhance with exogenous datasets (e.g. geocoding, census data)
  5. Never forget Data Ethics and Algorithmic Bias

Q&A

Follow us

Map code

library(maps)

# Map data preparation with country name adjustments
d_projects <- d_projects %>%
  mutate(hq_country = case_when(
    hq_country == "United States" ~ "USA",
    hq_country == "United Kingdom" ~ "UK",
    TRUE ~ hq_country
  ))


# Load world map data
world_map <- map_data("world")

# Join your project data and prepare the map data
map_data <- world_map %>%
  left_join(d_projects %>% 
              count(hq_country, name = "n_projects"), by = c("region" = "hq_country")) %>%
  replace_na(list(n_projects = NA))

# Plotting the map
ggplot(map_data, aes(x = long, y = lat, group = group, fill = n_projects)) +
  geom_polygon(color = "#1C1C1C", size = 0.15) +  # Adjust border color for better visibility on dark background
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Number of Projects", na.value = "#313131") +
  labs(title = "Number of Projects by Headquarters Country", x = "", y = "") +
  theme_void() + 
  theme(
    text = element_text(color = "white"),  # Changes text color to white
    plot.background = element_rect(fill = "black", color = NA),  # Dark plot background
    panel.background = element_rect(fill = "black", color = NA),  # Dark panel background
    panel.grid.major = element_blank(),  # Adjust grid color and size
    panel.grid.minor = element_blank(),  # No minor grid
    plot.title = element_text(color = "white", hjust = 0.5),  # Title in white and centered
    axis.text = element_blank(),  # Remove axis text
    axis.ticks = element_blank(),  # Remove axis ticks
    legend.background = element_rect(fill = "black", color = NA),  # Dark legend background
    legend.text = element_text(color = "white")  # White legend text
  )